The goal of this workshop is to use a real-world data set (police stops in Minneapolis in 2017) to work through a bunch of the Tidyverse’s more advanced functions. The workshop is catered to users who have some familiarity with the Tidyverse’s main principles (tidy data, the pipe, ggplot2) but want to take their knowledge a step further to do more complex data manipulation, summaries and graphing. It’s basically an overview of tools that data scientists end up using frequently.
Learners in this workshop will work to:
1. Quickly summarize and gain knowledge about the overview of a new data set
2. Filter their data by columns and by certain conditions
3. Learn how to transform and add new columns to existing data sets
4. Understand powerful grouping techniques to apply transformations and summaries
5. Summarize their data multiple times to gain critical insights
6. Join relational data together
7. Generate plots by piping data directly into ggplot2
Generally, the format will be for me to live code and for all of you to follow along and work on your individual scripts, but there will be a few times during the workshop that I’ll pose a challenge. This is the time for you and your table to get together and try and solve a problem.
The first step is to snag the data we’re going to use for this workshop. It’s an interesting real-life dataset: a large data set of stop-records for the Minneapolis police department in 2017. Let’s get it.
#Reading in from my github repo
library(tidyverse)
police_data = read_csv("https://tinyurl.com/y55y2hjk")
## Warning: Missing column names filled in: 'X1' [1]
## Warning: 19110 parsing failures.
## row col expected actual file
## 28573 citationIssued 1/0/T/F/TRUE/FALSE NO 'https://tinyurl.com/y55y2hjk'
## 28577 citationIssued 1/0/T/F/TRUE/FALSE NO 'https://tinyurl.com/y55y2hjk'
## 28589 citationIssued 1/0/T/F/TRUE/FALSE NO 'https://tinyurl.com/y55y2hjk'
## 28597 citationIssued 1/0/T/F/TRUE/FALSE NO 'https://tinyurl.com/y55y2hjk'
## 28599 citationIssued 1/0/T/F/TRUE/FALSE NO 'https://tinyurl.com/y55y2hjk'
## ..... .............. .................. ...... ..............................
## See problems(...) for more details.
#Get some parsing failures, but we can scope that out in a sec
#Using glimpse (a superior version of str())
glimpse(police_data)
## Observations: 51,920
## Variables: 15
## $ X1 <dbl> 6823, 6824, 6825, 6826, 6827, 6828, 6829, 6830, 6…
## $ idNum <chr> "17-000003", "17-000007", "17-000073", "17-000092…
## $ date <dttm> 2017-01-01 00:00:42, 2017-01-01 00:03:07, 2017-0…
## $ problem <chr> "suspicious", "suspicious", "traffic", "suspiciou…
## $ MDC <chr> "MDC", "MDC", "MDC", "MDC", "MDC", "MDC", "MDC", …
## $ citationIssued <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ personSearch <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "…
## $ vehicleSearch <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "…
## $ preRace <chr> "Unknown", "Unknown", "Unknown", "Unknown", "Unkn…
## $ race <chr> "Unknown", "Unknown", "White", "East African", "W…
## $ gender <chr> "Unknown", "Male", "Female", "Male", "Female", "M…
## $ lat <dbl> 44.96662, 44.98045, 44.94835, 44.94836, 44.97908,…
## $ long <dbl> -93.24646, -93.27134, -93.27538, -93.28135, -93.2…
## $ policePrecinct <dbl> 1, 1, 5, 5, 1, 1, 1, 2, 2, 4, 5, 1, 2, 1, 1, 1, 4…
## $ neighborhood <chr> "Cedar Riverside", "Downtown West", "Whittier", "…
#Overall this looks good - we can see we've got ~51k observations with 15 variables
#Let's scope out what's going on with those parsing failures....
problems(police_data)
## # A tibble: 19,110 x 5
## row col expected actual file
## <int> <chr> <chr> <chr> <chr>
## 1 28573 citationIssu… 1/0/T/F/TRUE/FAL… NO 'https://tinyurl.com/y55y2…
## 2 28577 citationIssu… 1/0/T/F/TRUE/FAL… NO 'https://tinyurl.com/y55y2…
## 3 28589 citationIssu… 1/0/T/F/TRUE/FAL… NO 'https://tinyurl.com/y55y2…
## 4 28597 citationIssu… 1/0/T/F/TRUE/FAL… NO 'https://tinyurl.com/y55y2…
## 5 28599 citationIssu… 1/0/T/F/TRUE/FAL… NO 'https://tinyurl.com/y55y2…
## 6 28601 citationIssu… 1/0/T/F/TRUE/FAL… NO 'https://tinyurl.com/y55y2…
## 7 28604 citationIssu… 1/0/T/F/TRUE/FAL… NO 'https://tinyurl.com/y55y2…
## 8 28605 citationIssu… 1/0/T/F/TRUE/FAL… NO 'https://tinyurl.com/y55y2…
## 9 28606 citationIssu… 1/0/T/F/TRUE/FAL… NO 'https://tinyurl.com/y55y2…
## 10 28607 citationIssu… 1/0/T/F/TRUE/FAL… YES 'https://tinyurl.com/y55y2…
## # … with 19,100 more rows
#Looks like the culprit is the citation issued column - let's just reimport with
#an increased guess amount to include this
address = "https://tinyurl.com/y55y2hjk"
police_data = read_csv(address, guess_max = 29000)
## Warning: Missing column names filled in: 'X1' [1]
glimpse(police_data)
## Observations: 51,920
## Variables: 15
## $ X1 <dbl> 6823, 6824, 6825, 6826, 6827, 6828, 6829, 6830, 6…
## $ idNum <chr> "17-000003", "17-000007", "17-000073", "17-000092…
## $ date <dttm> 2017-01-01 00:00:42, 2017-01-01 00:03:07, 2017-0…
## $ problem <chr> "suspicious", "suspicious", "traffic", "suspiciou…
## $ MDC <chr> "MDC", "MDC", "MDC", "MDC", "MDC", "MDC", "MDC", …
## $ citationIssued <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ personSearch <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "…
## $ vehicleSearch <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "…
## $ preRace <chr> "Unknown", "Unknown", "Unknown", "Unknown", "Unkn…
## $ race <chr> "Unknown", "Unknown", "White", "East African", "W…
## $ gender <chr> "Unknown", "Male", "Female", "Male", "Female", "M…
## $ lat <dbl> 44.96662, 44.98045, 44.94835, 44.94836, 44.97908,…
## $ long <dbl> -93.24646, -93.27134, -93.27538, -93.28135, -93.2…
## $ policePrecinct <dbl> 1, 1, 5, 5, 1, 1, 1, 2, 2, 4, 5, 1, 2, 1, 1, 1, 4…
## $ neighborhood <chr> "Cedar Riverside", "Downtown West", "Whittier", "…
#Let's check out that column
unique(police_data$citationIssued)
## [1] NA "NO" "YES"
Ok, so now that we have our data and have given it a cursory glance, I’m sure you can think of a million questions to ask about it, some of which we will get to in a bit. But first, let’s go over one of the fundamental tools of tidyverse: filtering your data via the select() and filter() functions.
select() allows you to pull out certain columns of data, while filter() allows you select only certain rows (usually based on some set of conditions). Let’s start with select().
The first thing we might want to do is clean up our dataframe - it’s probably not absolutely necessary to have the X1 column (which is just an ID number and is replicated by the idNum column), also… MDC is not a useful addition.
#I usually start non-destructively - note lack of quotations and negative signs
police_data %>%
select(-MDC, -X1)
#We can also use select to change the order and rename
police_data %>%
select(citation_issued = citationIssued, lat, lon = long, date)
#And we also have helper functions if you a boatload of columns
police_data %>%
select(contains("Search"), lat:neighborhood)
You can already see the power over base R - I’m not even sure how I would go about coding that same functionality with bracket or dollar-sign indexing… this is easy, fast and powerful.
Let’s do some filtering.
Often, you may want to subsections of a dataframe for graphing, analysis or summarization. For instance, it might be useful to only have a subset of the data where be useful to have a portion of the data where citations were issued. This is easy using the filter() function.
police_data %>%
filter(citationIssued == "YES")
## # A tibble: 3,211 x 15
## X1 idNum date problem MDC citationIssued
## <dbl> <chr> <dttm> <chr> <chr> <chr>
## 1 36603 17-2… 2017-07-06 13:27:32 suspic… MDC YES
## 2 36615 17-2… 2017-07-06 14:34:37 traffic MDC YES
## 3 36620 17-2… 2017-07-06 15:09:29 traffic MDC YES
## 4 36626 17-2… 2017-07-06 15:58:31 traffic MDC YES
## 5 36636 17-2… 2017-07-06 17:20:22 suspic… MDC YES
## 6 36638 17-2… 2017-07-06 17:27:32 suspic… MDC YES
## 7 36659 17-2… 2017-07-06 18:45:15 traffic MDC YES
## 8 36665 17-2… 2017-07-06 19:47:12 traffic MDC YES
## 9 36677 17-2… 2017-07-06 21:12:10 suspic… MDC YES
## 10 36707 17-2… 2017-07-06 23:45:35 suspic… MDC YES
## # … with 3,201 more rows, and 9 more variables: personSearch <chr>,
## # vehicleSearch <chr>, preRace <chr>, race <chr>, gender <chr>,
## # lat <dbl>, long <dbl>, policePrecinct <dbl>, neighborhood <chr>
You can also combine filters using & or |.
police_data %>%
filter(citationIssued == "YES" & personSearch == "YES")
## # A tibble: 459 x 15
## X1 idNum date problem MDC citationIssued
## <dbl> <chr> <dttm> <chr> <chr> <chr>
## 1 36636 17-2… 2017-07-06 17:20:22 suspic… MDC YES
## 2 36665 17-2… 2017-07-06 19:47:12 traffic MDC YES
## 3 36773 17-2… 2017-07-07 10:55:30 traffic MDC YES
## 4 36860 17-2… 2017-07-07 20:17:01 suspic… MDC YES
## 5 36927 17-2… 2017-07-08 02:49:26 traffic MDC YES
## 6 36956 17-2… 2017-07-08 14:42:05 suspic… MDC YES
## 7 36972 17-2… 2017-07-08 19:04:43 suspic… MDC YES
## 8 37019 17-2… 2017-07-09 01:06:12 traffic MDC YES
## 9 37085 17-2… 2017-07-09 21:22:49 suspic… MDC YES
## 10 37168 17-2… 2017-07-10 12:45:39 traffic MDC YES
## # … with 449 more rows, and 9 more variables: personSearch <chr>,
## # vehicleSearch <chr>, preRace <chr>, race <chr>, gender <chr>,
## # lat <dbl>, long <dbl>, policePrecinct <dbl>, neighborhood <chr>
We can also do more complicated time-based filters with dplyr and lubridate.
For instance, maybe we just want to filter our data to look at records where citations were issued in Novemebr of 2017.
#loading lubridate
library(lubridate)
police_data %>%
filter(citationIssued == "YES" & lubridate::month(date) == 11)
## # A tibble: 537 x 15
## X1 idNum date problem MDC citationIssued
## <dbl> <chr> <dttm> <chr> <chr> <chr>
## 1 53626 17-4… 2017-11-01 09:45:06 suspic… MDC YES
## 2 53633 17-4… 2017-11-01 11:24:15 traffic MDC YES
## 3 53658 17-4… 2017-11-01 13:54:31 traffic MDC YES
## 4 53661 17-4… 2017-11-01 14:16:20 traffic MDC YES
## 5 53662 17-4… 2017-11-01 14:25:11 traffic MDC YES
## 6 53663 17-4… 2017-11-01 14:30:05 traffic MDC YES
## 7 53667 17-4… 2017-11-01 15:22:02 suspic… MDC YES
## 8 53668 17-4… 2017-11-01 15:40:35 suspic… MDC YES
## 9 53669 17-4… 2017-11-01 15:43:18 traffic MDC YES
## 10 53677 17-4… 2017-11-01 16:12:16 suspic… MDC YES
## # … with 527 more rows, and 9 more variables: personSearch <chr>,
## # vehicleSearch <chr>, preRace <chr>, race <chr>, gender <chr>,
## # lat <dbl>, long <dbl>, policePrecinct <dbl>, neighborhood <chr>
You can also filter by group_by (which we’ll get to in a sec). Some useful functions include:
1. ==, >, >=
2. is.na()
3. !
4. between(), near()
Using the police data above work with your group/partner to create a subset of the larger data set of traffic violation stops on black men between 1 and 4 am where a citation was issued. Include the date, problem, citationIssued, race, gender, lat and long columns.
The time problem is tricky if you’re not familiar with lubridate (and even if you are) - check out the cheat sheet and google!
police_data %>%
select(date, problem, citationIssued, race, gender, lat, long) %>%
filter(race == "Black" & gender == "Male" & citationIssued == "YES" & problem == "traffic") %>%
filter(between(hour(date), 1, 4))
## # A tibble: 82 x 7
## date problem citationIssued race gender lat long
## <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2017-07-08 02:49:26 traffic YES Black Male 44.9 -93.3
## 2 2017-07-11 02:04:15 traffic YES Black Male 45.0 -93.3
## 3 2017-07-11 02:46:18 traffic YES Black Male 45.0 -93.3
## 4 2017-07-14 02:31:29 traffic YES Black Male 45.0 -93.3
## 5 2017-07-15 01:31:15 traffic YES Black Male 45.0 -93.3
## 6 2017-07-17 01:34:50 traffic YES Black Male 45.0 -93.3
## 7 2017-07-17 02:44:55 traffic YES Black Male 45.0 -93.3
## 8 2017-07-22 01:06:41 traffic YES Black Male 45.0 -93.3
## 9 2017-07-23 01:25:27 traffic YES Black Male 45.0 -93.3
## 10 2017-07-27 01:24:26 traffic YES Black Male 45.0 -93.3
## # … with 72 more rows
Mutations are really just mildly-confusing term to describe creating columsn from row-by-row calcuations (either making new columns or replacing old columns with new ones). This is maybe my most-used function from the tidyverse - lots of utility!
I’d say it’s generally more applicable with continuous data, of which this data set is lacking, but there is still utility here. Let’s say you wanted to add a column that indicates whether or not the cop for a given record was successful at determining the race of the person being stopped. We can add a new column to the data easily!
police_data %>%
select(date, problem, preRace, race, gender, lat, long) %>% #slimming down the data a bit first so we can see things more clearly
mutate(race_correct = ifelse(preRace == race, TRUE, FALSE))
## # A tibble: 51,920 x 8
## date problem preRace race gender lat long
## <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2017-01-01 00:00:42 suspic… Unknown Unkn… Unkno… 45.0 -93.2
## 2 2017-01-01 00:03:07 suspic… Unknown Unkn… Male 45.0 -93.3
## 3 2017-01-01 00:23:15 traffic Unknown White Female 44.9 -93.3
## 4 2017-01-01 00:33:48 suspic… Unknown East… Male 44.9 -93.3
## 5 2017-01-01 00:37:58 traffic Unknown White Female 45.0 -93.3
## 6 2017-01-01 00:46:48 traffic Unknown East… Male 45.0 -93.3
## 7 2017-01-01 00:48:46 suspic… Unknown Black Male 45.0 -93.3
## 8 2017-01-01 00:50:55 traffic Unknown Other Female 45.0 -93.2
## 9 2017-01-01 00:57:10 traffic Unknown White Male 45.0 -93.3
## 10 2017-01-01 01:05:50 traffic Unknown Black Male 45.0 -93.3
## # … with 51,910 more rows, and 1 more variable: race_correct <lgl>
And here’s another more complicated example that adds an evening, morning or night designator.
police_data %>%
select(date, problem, preRace, race, gender, lat, long, policePrecinct) %>%
mutate(time_of_day = ifelse(between(hour(date), 5, 12), "Morning",
ifelse(between(hour(date), 13, 19), "Afternoon", "Night")))
## # A tibble: 51,920 x 9
## date problem preRace race gender lat long
## <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2017-01-01 00:00:42 suspic… Unknown Unkn… Unkno… 45.0 -93.2
## 2 2017-01-01 00:03:07 suspic… Unknown Unkn… Male 45.0 -93.3
## 3 2017-01-01 00:23:15 traffic Unknown White Female 44.9 -93.3
## 4 2017-01-01 00:33:48 suspic… Unknown East… Male 44.9 -93.3
## 5 2017-01-01 00:37:58 traffic Unknown White Female 45.0 -93.3
## 6 2017-01-01 00:46:48 traffic Unknown East… Male 45.0 -93.3
## 7 2017-01-01 00:48:46 suspic… Unknown Black Male 45.0 -93.3
## 8 2017-01-01 00:50:55 traffic Unknown Other Female 45.0 -93.2
## 9 2017-01-01 00:57:10 traffic Unknown White Male 45.0 -93.3
## 10 2017-01-01 01:05:50 traffic Unknown Black Male 45.0 -93.3
## # … with 51,910 more rows, and 2 more variables: policePrecinct <dbl>,
## # time_of_day <chr>
Lots of options for continuous data, think about the functions:
1. +, -, log() 2. lead(), lag() 3. percent_rank(), ntile() 4. cummean() 5. na_if()
Replace the race column in the dataset with a column of the same name except that the value “Unknown” is replaced with NA.
police_data %>%
mutate(race = na_if(race, "Unknown"))
## # A tibble: 51,920 x 15
## X1 idNum date problem MDC citationIssued
## <dbl> <chr> <dttm> <chr> <chr> <chr>
## 1 6823 17-0… 2017-01-01 00:00:42 suspic… MDC <NA>
## 2 6824 17-0… 2017-01-01 00:03:07 suspic… MDC <NA>
## 3 6825 17-0… 2017-01-01 00:23:15 traffic MDC <NA>
## 4 6826 17-0… 2017-01-01 00:33:48 suspic… MDC <NA>
## 5 6827 17-0… 2017-01-01 00:37:58 traffic MDC <NA>
## 6 6828 17-0… 2017-01-01 00:46:48 traffic MDC <NA>
## 7 6829 17-0… 2017-01-01 00:48:46 suspic… MDC <NA>
## 8 6830 17-0… 2017-01-01 00:50:55 traffic MDC <NA>
## 9 6831 17-0… 2017-01-01 00:57:10 traffic MDC <NA>
## 10 6832 17-0… 2017-01-01 01:05:50 traffic MDC <NA>
## # … with 51,910 more rows, and 9 more variables: personSearch <chr>,
## # vehicleSearch <chr>, preRace <chr>, race <chr>, gender <chr>,
## # lat <dbl>, long <dbl>, policePrecinct <dbl>, neighborhood <chr>
One of the most powerful features of dplyr is the ability to do stuff on groups or chunks of your data. Sometimes you’ll want to leave the dataset intact, and other times you’ll want to summarize the data after you do these groupings. Let’s look at some examples of both.
First, let’s look at some relatively complicated manipulations that leave all the data in place… we’re going to calculate the average hour at which stops happen for each month, and then filter the data frame for only incidents that happen after that average.
police_data %>%
group_by(month(date)) %>%
mutate(mean_hour = mean(hour(date))) %>%
filter(hour(date) > mean_hour) %>%
tail()
## # A tibble: 6 x 17
## # Groups: month(date) [1]
## X1 idNum date problem MDC citationIssued personSearch
## <dbl> <chr> <dttm> <chr> <chr> <chr> <chr>
## 1 60833 17-4… 2017-12-31 23:11:15 suspic… MDC NO NO
## 2 60834 17-4… 2017-12-31 23:15:50 traffic MDC YES NO
## 3 60835 17-4… 2017-12-31 23:18:32 suspic… MDC NO NO
## 4 60836 17-4… 2017-12-31 23:31:57 traffic MDC NO NO
## 5 60837 17-4… 2017-12-31 23:48:22 traffic MDC NO YES
## 6 60838 17-4… 2017-12-31 23:52:35 traffic MDC NO NO
## # … with 10 more variables: vehicleSearch <chr>, preRace <chr>,
## # race <chr>, gender <chr>, lat <dbl>, long <dbl>, policePrecinct <dbl>,
## # neighborhood <chr>, `month(date)` <dbl>, mean_hour <dbl>
Now let’s look at a summary - this is where stuff get’s really useful. Let’s say that you wanted to compare the total number of incidents among white and black men and women.
police_data %>%
filter(race == "Black" | race == "White") %>%
group_by(race, gender) %>%
summarize(n = n())
## # A tibble: 8 x 3
## # Groups: race [2]
## race gender n
## <chr> <chr> <int>
## 1 Black Female 3510
## 2 Black Male 11630
## 3 Black Unknown 64
## 4 Black <NA> 16
## 5 White Female 4036
## 6 White Male 7635
## 7 White Unknown 20
## 8 White <NA> 12
#We could also do the same thing to include precinct to see if there are biases by gender or race by precinct
police_data %>%
filter(race == "Black" | race == "White") %>%
group_by(race, gender, policePrecinct) %>%
summarize(n = n())
## # A tibble: 38 x 4
## # Groups: race, gender [8]
## race gender policePrecinct n
## <chr> <chr> <dbl> <int>
## 1 Black Female 1 396
## 2 Black Female 2 602
## 3 Black Female 3 379
## 4 Black Female 4 1542
## 5 Black Female 5 591
## 6 Black Male 1 1769
## 7 Black Male 2 1619
## 8 Black Male 3 1660
## 9 Black Male 4 4605
## 10 Black Male 5 1977
## # … with 28 more rows
#A lot of data, probably better to visualize... we'll get to this later, but all of this can be piped directly into ggplot.
police_data %>%
filter(race == "Black" | race == "White") %>%
group_by(race, gender, policePrecinct) %>%
summarize(n = n()) %>%
ggplot(aes(x = race, fill = factor(gender), y = n)) +
geom_col() +
facet_wrap(~ policePrecinct)
Let’s address another question - which are the top 5 neighborhoods with the most number of incidents?
police_data %>%
group_by(neighborhood) %>%
summarize(n = n()) %>%
top_n(n = 5, wt = n) %>%
arrange(desc(n))
## # A tibble: 5 x 2
## neighborhood n
## <chr> <int>
## 1 Downtown West 4409
## 2 Whittier 3328
## 3 Near - North 2256
## 4 Lyndale 2154
## 5 Jordan 2075
Summarize is really powerful, and you can do it multiple times (depending on your grouping structure) and use functions like:
1. mean() and median()
2. sd()
3. quantile(), min(), max()
4. first(), last()
5. n_distinct()
For this next challenge, develop code that shows the number of stops for each police precinct for white men and women (sort by precinct).
Bonus challenge - which precinct has the most equal number of arrests between men and women? What is the percentage?
#Part 1
police_data %>%
filter(race == "White") %>%
filter(gender == "Male" | gender == "Female") %>%
group_by(policePrecinct, gender) %>%
summarize(n = n()) %>%
arrange((policePrecinct))
## # A tibble: 10 x 3
## # Groups: policePrecinct [5]
## policePrecinct gender n
## <dbl> <chr> <int>
## 1 1 Female 404
## 2 1 Male 962
## 3 2 Female 1083
## 4 2 Male 2212
## 5 3 Female 443
## 6 3 Male 863
## 7 4 Female 400
## 8 4 Male 790
## 9 5 Female 1706
## 10 5 Male 2808
#Part 2
police_data %>%
filter(race == "White") %>%
filter(gender == "Male" | gender == "Female") %>%
group_by(policePrecinct, gender) %>%
summarize(n = n()) %>%
arrange((policePrecinct)) %>%
ungroup() %>%
group_by(policePrecinct) %>%
mutate(percent_of_total = n/sum(n)) %>%
mutate(percent_diff = max(percent_of_total)-min(percent_of_total))
## # A tibble: 10 x 5
## # Groups: policePrecinct [5]
## policePrecinct gender n percent_of_total percent_diff
## <dbl> <chr> <int> <dbl> <dbl>
## 1 1 Female 404 0.296 0.408
## 2 1 Male 962 0.704 0.408
## 3 2 Female 1083 0.329 0.343
## 4 2 Male 2212 0.671 0.343
## 5 3 Female 443 0.339 0.322
## 6 3 Male 863 0.661 0.322
## 7 4 Female 400 0.336 0.328
## 8 4 Male 790 0.664 0.328
## 9 5 Female 1706 0.378 0.244
## 10 5 Male 2808 0.622 0.244
A lot of dplyr syntax comes from SQL - a language designed to interact with databases, which are often just tables that are linked to each other through key columns. Imagine a scneario where you have a set of information about users - one table describes how much they’ve been on a certain website (number of hours total, number of days logged in, etc.), and another describes some of their personal attributes (gender, date of birth, login, username, etc.). You can imagine wanting to join the information from these two tables together to do certain analyses. Being able to join together different tables of data is a frequent occurence, and really, really easy with dplyr syntax. You may be asking yourself why it isn’t easier just to have everything in one giant table…. giant tables are slow and bulky. A linked database with many smaller tables is more agile. Most folks don’t want all the information all the time! Forwhat it’s worth - dplyr can also play nice with databases…. but that’s for another lesson. :)
For these examples and exercises, I’ve split the data into three different tables that are each unified by a different theme: incident information, demographic data and geographic data. This is supposed to mimic something you might see in a real database (or get from one of your colleagues). Let’s load and look at them.
incident_info = read_csv("./data/police_incident_info.csv", guess_max = 29000)
## Parsed with column specification:
## cols(
## idNum = col_character(),
## date = col_datetime(format = ""),
## problem = col_character(),
## citationIssued = col_character(),
## personSearch = col_character(),
## vehicleSearch = col_character()
## )
geography = read_csv("./data/police_geography.csv")
## Parsed with column specification:
## cols(
## idNum = col_character(),
## lat = col_double(),
## long = col_double(),
## policePrecinct = col_double(),
## neighborhood = col_character()
## )
demographics = read_csv("./data/police_demographics.csv")
## Parsed with column specification:
## cols(
## idNum = col_character(),
## preRace = col_character(),
## race = col_character(),
## gender = col_character()
## )
summary_data = read_csv("./data/police_data_summary.csv")
## Parsed with column specification:
## cols(
## policePrecinct = col_double(),
## neighborhood = col_character(),
## n_incidents = col_double()
## )
#glimpse(incident_info)
#Others too...
There are couple main types of joins, depending on which combinations of tables you want:
Mutating Joins
Filtering Joins
#Let's do this with pipe syntax - simple join where we're keeping everything that is common between the two dataframes
new = incident_info %>%
inner_join(geography, by = "idNum")
glimpse(new)
## Observations: 51,920
## Variables: 10
## $ idNum <chr> "17-000003", "17-000007", "17-000073", "17-000092…
## $ date <dttm> 2017-01-01 00:00:42, 2017-01-01 00:03:07, 2017-0…
## $ problem <chr> "suspicious", "suspicious", "traffic", "suspiciou…
## $ citationIssued <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ personSearch <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "…
## $ vehicleSearch <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "…
## $ lat <dbl> 44.96662, 44.98045, 44.94835, 44.94836, 44.97908,…
## $ long <dbl> -93.24646, -93.27134, -93.27538, -93.28135, -93.2…
## $ policePrecinct <dbl> 1, 1, 5, 5, 1, 1, 1, 2, 2, 4, 5, 1, 2, 1, 1, 1, 4…
## $ neighborhood <chr> "Cedar Riverside", "Downtown West", "Whittier", "…
#What happens when we want to join a smaller set of data (some summary data) onto a larger dataframe?
summary = new %>%
inner_join(summary_data, by = c("neighborhood", "policePrecinct"))
Fair warning - joins can get a lot more complicated than this when you have data that is represented in one table and not another, and it’s easy to get confused and lose data. This is just a primer! Check out the great book R for Data Science for more on joins!
Create a two dataframes using joins. First, one that only contains records for stops during the top 3 busiest hours of stops for each neighborhood. Second, one that contains only records for outside of the 3 busiest hours of stops for each neighborhood.
Hint: look up the hour() function from lubridate, and think about how to use semi_join() and anti_join().
top_hours = police_data %>%
mutate(hour = hour(date)) %>%
group_by(neighborhood, hour) %>%
summarize(n = n()) %>%
ungroup() %>%
group_by(neighborhood) %>%
top_n(n = 3, wt = n)
#Crimes in top 3 hours by neighborhood
police_data %>%
mutate(hour = hour(date)) %>%
semi_join(top_hours)
## Joining, by = c("neighborhood", "hour")
## # A tibble: 14,006 x 16
## X1 idNum date problem MDC citationIssued
## <dbl> <chr> <dttm> <chr> <chr> <chr>
## 1 6824 17-0… 2017-01-01 00:03:07 suspic… MDC <NA>
## 2 6827 17-0… 2017-01-01 00:37:58 traffic MDC <NA>
## 3 6828 17-0… 2017-01-01 00:46:48 traffic MDC <NA>
## 4 6829 17-0… 2017-01-01 00:48:46 suspic… MDC <NA>
## 5 6830 17-0… 2017-01-01 00:50:55 traffic MDC <NA>
## 6 6831 17-0… 2017-01-01 00:57:10 traffic MDC <NA>
## 7 6835 17-0… 2017-01-01 01:19:59 traffic MDC <NA>
## 8 6836 17-0… 2017-01-01 01:22:01 traffic MDC <NA>
## 9 6837 17-0… 2017-01-01 01:41:24 suspic… MDC <NA>
## 10 6838 17-0… 2017-01-01 01:45:09 suspic… MDC <NA>
## # … with 13,996 more rows, and 10 more variables: personSearch <chr>,
## # vehicleSearch <chr>, preRace <chr>, race <chr>, gender <chr>,
## # lat <dbl>, long <dbl>, policePrecinct <dbl>, neighborhood <chr>,
## # hour <int>
#crimes NOT in top 3 hours by neighborhood
police_data %>%
mutate(hour = hour(date)) %>%
anti_join(top_hours)
## Joining, by = c("neighborhood", "hour")
## # A tibble: 37,914 x 16
## X1 idNum date problem MDC citationIssued
## <dbl> <chr> <dttm> <chr> <chr> <chr>
## 1 6823 17-0… 2017-01-01 00:00:42 suspic… MDC <NA>
## 2 6825 17-0… 2017-01-01 00:23:15 traffic MDC <NA>
## 3 6826 17-0… 2017-01-01 00:33:48 suspic… MDC <NA>
## 4 6832 17-0… 2017-01-01 01:05:50 traffic MDC <NA>
## 5 6833 17-0… 2017-01-01 01:09:13 suspic… MDC <NA>
## 6 6834 17-0… 2017-01-01 01:15:01 traffic other <NA>
## 7 6839 17-0… 2017-01-01 01:51:02 suspic… MDC <NA>
## 8 6841 17-0… 2017-01-01 01:53:42 traffic MDC <NA>
## 9 6842 17-0… 2017-01-01 02:04:13 suspic… MDC <NA>
## 10 6843 17-0… 2017-01-01 02:09:42 traffic MDC <NA>
## # … with 37,904 more rows, and 10 more variables: personSearch <chr>,
## # vehicleSearch <chr>, preRace <chr>, race <chr>, gender <chr>,
## # lat <dbl>, long <dbl>, policePrecinct <dbl>, neighborhood <chr>,
## # hour <int>
The other great thing about dplyr and tidyverse as a whole is that you can pipe modified dataframes directly into ggplot2, and you can also do piped dplyr operations within ggplot2.
For example, let’s say that we wanted to do some mapping (we have lat lon info for all the stops, so this makes sense!).
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
register_google(key = "AIzaSyDyAqUc4o9p_DOBSF_JOXH5c_JXPqoU4Yw")
loc = c(police_data$lat[1], police_data$long[1])
our_map = get_map(location = "Minneapolis", zoom = 12)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Minneapolis&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Minneapolis&key=xxx
#Tooooo much data
ggmap(our_map) +
geom_point(data = police_data, aes(x = long, y = lat, color = race),
size = 1, alpha = 0.5)
## Warning: Removed 560 rows containing missing values (geom_point).
#Let's do filtering within ggplot
ggmap(our_map) +
stat_density_2d(data = police_data %>%
filter(race == "Black" | race == "White" | race == "East African"),
aes(x = long, y = lat, color = NULL, fill = race),
geom = "polygon",
alpha = 0.5) +
facet_wrap(~ race)
## Warning: Removed 233 rows containing non-finite values (stat_density2d).
#Could also add centroids
ggmap(our_map) +
stat_density_2d(data = police_data %>%
filter(race == "Black" | race == "White" | race == "East African"),
aes(x = long, y = lat, color = NULL, fill = race),
geom = "polygon",
alpha = 0.1) +
geom_point(data = police_data %>%
filter(race == "Black" | race == "White" | race == "East African") %>%
group_by(race) %>%
summarize(avg_lat = mean(lat),
avg_lon = mean(long)),
aes(x = avg_lon, y = avg_lat, color = race)) +
facet_wrap(~ race)
## Warning: Removed 233 rows containing non-finite values (stat_density2d).
#And we can also pipe directly into ggplot (though we can't do this for ggmap)
police_data %>%
mutate(day = day(date)) %>%
group_by(policePrecinct, day) %>%
summarize(n = n()) %>%
ggplot(aes(x = factor(policePrecinct), y = n)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
geom_jitter(width = 0.1, alpha = 0.5) +
# geom_dotplot(binaxis = "y", stackdir = 'center', dotsize = 0.5) +
theme_classic() +
xlab("Police Precinct") +
ylab("Number of stops per day")
Put it all together now! Determine which three precincts had the worst racial mis-match (when preRace didn’t line up with actual race). Plot a map of all the stops that involved racial mis-matches for the worst precinct, colored by the actual race of the individual stopped.
police_data %>%
filter(preRace != "Unknown" & race != "Unknown") %>%
mutate(mismatch = ifelse(preRace == race, FALSE, TRUE)) %>%
group_by(policePrecinct) %>%
count(mismatch) %>%
mutate(perc = n/sum(n)) %>%
filter(mismatch == TRUE) %>%
arrange(desc(perc))
## # A tibble: 5 x 4
## # Groups: policePrecinct [5]
## policePrecinct mismatch n perc
## <dbl> <lgl> <int> <dbl>
## 1 5 TRUE 338 0.0801
## 2 2 TRUE 106 0.0484
## 3 4 TRUE 78 0.0308
## 4 3 TRUE 61 0.0283
## 5 1 TRUE 36 0.0172
to_map = police_data %>%
filter(policePrecinct %in% c(5)) %>%
mutate(mismatch = ifelse(preRace == race, FALSE, TRUE)) %>%
filter(mismatch == TRUE)
mismatch_map = get_map(location = "Minneapolis", zoom = 11)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Minneapolis&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Minneapolis&key=xxx
ggmap(mismatch_map) +
geom_point(data = to_map, aes(x = long, y = lat, color = race), alpha = 0.5, size = 1) +
facet_wrap(~ race)